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We consider comparisons of statistical learning algorithms us¬ 
ing multiple data sets, via leave-one-in cross-study validation: each 
of the algorithms is trained on one data set; the resulting model is 
then validated on each remaining data set. This poses two statisti¬ 
cal challenges that need to be addressed simultaneously. The first is 
the assessment of study heterogeneity, with the aim of identifying 
a subset of studies within which algorithm comparisons can be reli¬ 
ably carried out. The second is the comparison of algorithms using 
the ensemble of data sets. We address both problems by integrating 
clustering and model comparison. We formulate a Bayesian model for 
the array of cross-study validation statistics, which defines clusters 
of studies with similar properties and provides the basis for meaning¬ 
ful algorithm comparison in the presence of study heterogeneity. We 
illustrate our approach through simulations involving studies with 
varying severity of systematic errors, and in the context of medical 
prognosis for patients diagnosed with cancer, using high-throughput 
measurements of the transcriptional activity of the tumor’s genes. 

1. Introduction. Predictive models, in most cases, need to be validated 
using data from independent studies. In many disciplines it is common for 
research communities to generate multiple data sets that address similar 
prediction problems. The availability of multiple data sets makes it possible 
to systematically compare the performance of alternative statistical learning 
algorithms, and to characterize their strengths and limitations in the context 
of a specific area of application. 

Here, the term learning algorithm is used for any procedure, say, linear 
regression or nearest neighbor classification, that produces prediction rules. 
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We consider the task of assessing learning algorithms, via what we call leave- 
one-in cross-study validation: the algorithm is trained on one data set; the 
resulting prediction model is then validated on each remaining data set, and 
a validation performance statistic (such as the classification error rate or 
the mean squared error of prediction) is recorded. By repeating this over 
all possible training data sets one generates a square array Z of validation 
statistics. Computation of leave-one-in matrices Z is, in most cases, straight¬ 
forward. Our goal is to develop a statistical framework for the analysis of 
leave-one-in matrices. 

Our motivation comes from earlier experience in clinical genomics [Garrett- 
Mayer et al. (2008)] where the goal is to predict individual outcomes based 
on high-dimensional features of the genome. Leave-one-in cross-study vali¬ 
dation is well suited to this context for two reasons. First, while different 
studies address the same prediction question, they may do so using different 
sampling designs or technological platforms, generating heterogeneity that 
makes it difficult to directly combine all data. Second, it is not uncommon 
for studies to be affected by unknown artifactual variation, such as the so- 
called batch effects, making it important to use methodologies that allow 
identification and separate handling of studies that show poor concordance 
with the majority of the rest [Baggerly, Coombes and Neeley (2008)]. 

Our perspective is therefore that cross-study validation should simultane¬ 
ously be concerned about two questions: the identification of heterogeneity 
and outliers among studies, and the comparison of alternative algorithms, 
done in a way that accounts for heterogeneity across studies. We achieve 
this by modeling directly each of the algorithm-specific Z matrices. Vari¬ 
ability in the validation measures contained in a Z matrix may arise from 
several sources, including differences in study design, study populations and 
measurement technologies, as well as accidental causes that may have af¬ 
fected data quality in individual studies. To illustrate, imagine the outcome 
of interest is determined by a different set of predictors in different geo¬ 
graphical areas. A collection of studies may include two major clusters of 
studies, each confined to a given area. Performance evaluations are best han¬ 
dled by considering cross-study validation within each of these clusters, as a 
good algorithm should not be required to generate models that predict well 
across geographical areas when trained on data from a single area. Similar 
considerations apply to clusters defined by technological platforms. 

We propose a two-stage procedure. The first stage addresses sampling 
variation in the Z array via Bootstrap. The second stage infers a latent par¬ 
tition of the studies defined by a Dirichlet process. Studies will be assigned 
to the same subset when the corresponding vectors of validation statistics 
are similar. Conversely, if the Z array provides evidence of heterogeneity 
between two studies, then these will tend to be assigned to separate clus¬ 
ters. Our model achieves two goals: (i) to cluster studies using Z, generating 
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hypotheses on the sources of heterogeneity; and (ii) to provide cluster-based 
summaries of algorithm performance, allowing for comparisons that account 
for heterogeneity and possible systematic artifacts in the study pool. 

Clustering based on the Z matrix is perhaps most attractive in the context 
of prediction problems with a large number of predictors. High dimensional¬ 
ity makes it difficult to spot the important differences between studies and 
to understand the factors hindering cross-study replicability. In this sce¬ 
nario, it is important to provide a solid evaluation of prediction strategies 
using distinct training and validation data sets. This evaluation should be 
rooted in the context of a specific application. The Z matrix helps in this: its 
strengths and limitations arise from reducing the problem to a single figure 
of merit for prediction performance. It is simple to interpret and easy to 
visualize. Also, it is not affected by subtle issues such as over-fitting, batch 
effects and selection of favorable training/testing combinations. The goal of 
our Bayesian procedure is to retain these advantages of the Z matrix, to 
provide an accurate uncertainty analysis and to suggest clusters for further 
inquiry. 

While the motivation and examples for our methodology come from clin¬ 
ical genomics, the only requirement for its application is the availability of 
independent studies using similar approaches to measure predictors. 

2. Bayesian cross-study validation analysis. 

2.1. The leave-one-in validation performance matrix Z. We consider a 
set of S studies, indexed by s and including n s samples, indexed by i. For 
study s, we have measurements on outcomes Y S j and predictors X S) i. Our 
focus is the two-dimensional array of validation statistics Z = (Z s>v ',s,v = 
1,..., S, s ^ v). We use the term algorithm to refer to a training methodology 
(such as CART or ridge regression) and the term model to refer to a specific 
prediction rule, resulting from using the algorithm on a training data set. 
For a given algorithm, the statistic Z s>v measures the predictive performance 
of the model trained on data set s, when validated on a different data set 
v. Typical definitions of Z SfV with binary outcomes include the classification 
error rate and, if the model generates risk scores for binary outcomes, the 
area under the operating characteristic curve (AUC). Validation statistics 
for time-to-event outcomes include versions of the concordance index [Uno 
et al. (2011) and references therein]. Our approach is based on the Z matrix 
and does not include direct modeling of the data at the individual level. 
This choice is motivated by the goal of obtaining easily interpretable results 
with modest computational effort. 

In addition to Z SjV , with s/r, one can also consider the variables Z s<s , 
obtained by standard cross-validation, iteratively splitting the data set into 
training and validation components. Here we do not use the variables Z S)S 
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to avoid summary statistics that might be inflated by systematic errors or 
batch effects. 

2.2. Relation to Bayesian meta-analysis. There are important points of 
contact, as well as differences, between our approach and existing ideas in 
Bayesian meta-analysis. 

Bayesian modeling allows one to easily account for study heterogeneity. 
Several approaches are based on hierarchical models [Berry (1990)]. For ex¬ 
ample, Warn, Thompson and Spiegelhalter (2002) consider S = 31 random¬ 
ized trials for assessing the analgesic Ibuprofen. The data for each study 
consist of sample size, number of individuals randomized to placebo and 
number of events (pain relief) for each arm. Treatment assignments X S j 
and outcomes Y s ^ are binary. They specify a hierarchical model with latent 
parameters 6 S describing success rates in each study and an unknown dis¬ 
tribution F describing variability in the study specific parameters, that is, 
0 S \F ' F. The assumption that, conditionally on these parameters, indi¬ 
vidual observations within each study are independent completes the model. 

Heterogeneity of study-specific parameters is often better understood via 
clustering, as we will propose here. Berry and Christensen (1979) introduced 
the idea of using a Dirichlet prior for F. A practical advantage of the Dirich- 
let process in this context is the resulting discreteness of F. This implies that 
when (6i,... ,9s) are sampled either from the prior or from the posterior, 
one observes clusters of studies: for every pair (s,v) the event 9 S = 9 V has 
positive probability. Thus, one obtains a posteriori the distribution of a la¬ 
tent random partition of the studies {1,..., S} dictated by ties in the values 
of the parameters (0\,... ,0s). While evidence synthesis may average over 
the distribution of this partition, cluster analysis can be performed by se¬ 
lecting a single representative partition. Model-based clustering and the use 
of a latent partition are effective for dealing with questions and hypotheses 
such as (i) the response probabilities are the same across studies, (ii) there 
exists a large group of studies sharing identical response probabilities and 
(iii) there are studies that should be considered outliers. 

2.3. Two-stage analysis. Our validation analysis uses a summary of the 
data, consisting of (i) the Z array and (ii) a parametric estimate d of the 
unknown joint distribution d of the zero mean random variables Z sv — ( SjV , 
where s,v = 1,..., S, s / v, and ( SjV is the expected value of Z StV . The ex¬ 
pected values ( StV = IE p a ,p v (Z S)V ) refer to the true unknown distributions of 
the data P s and P v within studies s and v. These are joint distributions 
including both predictors and outcomes, and might vary across studies. 

Our approach is in two stages. The first stage estimates the dispersion of 
the Z S)V random variables. The second stage is based on a Bayesian model, 
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specified using a Dirichlet prior and the dispersion of the Z' s estimated in 
the first stage. 

We propose a simple hierarchical model for Z that balances (i) the need, 
as in any validation study, of easily interpretable summary statistics that 
are free of questionable assumptions and (ii) the goal of detecting clusters 
of studies and possible outliers. We chose a prior model for Z with a mini¬ 
mal level of complexity in order to avoid difficulties in the interpretation of 
the resulting estimates. Similar to Bayesian meta-analysis, we use latent pa¬ 
rameters for the unknown means of our Z random variables. The posterior 
distribution of these parameters, as discussed in Section 3, allows clustering 
of the studies. The goal of the model is to cluster studies with similar data 
quality, as well as studies sharing similarities in their designs and imple¬ 
mentations. We will first provide a description of our model and clustering 
approach in Section 3, assuming identical sample sizes n± = ■ ■ ■ = ns across 
studies, and then remove this constraint. 

One of the advantages of modeling the Z array is the possibility of esti¬ 
mating, for any pair of studies (s,v), the distribution of Z s>v should both 
studies be performed a second time. Estimates can be derived using the hy¬ 
pothesis that data are newly generated under identical technical conditions 
and that the populations from which samples arise remain identical. When 
the estimates of £ s v are combined with the inferred partition of the studies 
{1,... ,S}, these contribute to interpretation of the observed values in our 
Z array. 

Stage 1. The first stage estimates d, with the goal of obtaining an ap¬ 
proximate Bayesian analysis for the observed Z array. The approximation 
consists of plugging an estimate of d into the Bayesian model (Stage 2) to 
bypass computationally intensive joint modeling of S data sets. A practical 
method is the Bootstrap, either in its frequentist [Efron (1979)] or Bayesian 
[Rubin (1981)] versions. The resulting distribution is representative of the 
sampling variability of the Z statistics. The observed variations across the 
Z's are due to both sampling variability and also to possible differences 
across the study-specific distributions P \,..., Ps ■ 

The only result from the first stage of our procedure that we use in the 
analysis of the leave-one-in array is the estimate d. Alternative estimators 
of d could in principle be used. Here we use the bootstrap because of its 
broad applicability. It can be applied to Z matrices generated by a spectrum 
of training methods ranging from popular machine learning procedures to 
algorithms highly tailored to specific application areas. Also, the bootstrap 
can estimate the variability of a number of possible validation summaries, 
such as the misclassification error rate or the mean squared error, that can 
be used to define Z arrays. Finally, the bootstrap is applicable wether or 
not there exists a probability model consistent with the training algorithm. 
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The Bootstrap [Efron (1979)] for estimating d includes (i) the compu¬ 
tation of the empirical distributions Pi, Ps , which (ii) are then itera¬ 
tively used for obtaining S independent Bootstrap samples, one for each 
study, (Xl i ,Y*.,i<n 1 ),...,(X*.,Y*.-i<n s ), with (X* YL) ~ P s . Here 
we avoid the use of an additional index enumerating Bootstrap iterations. 
At each iteration the validation statistics are computed on the basis of 
(X* i ,Y* i ;i<ni),...,(Xg i ,Yg i ]i<ns), that is, the Z array is resampled. 
At each cycle we compute a prediction model using (XT. YT; i < n s ) and 
then validate it on (X* Y* t ; i < n v ), s^u, to obtain Z* v . Finally, (iii) 
we estimate d by centering the empirical distribution of the iteratively re¬ 
sampled arrays. The Bootstrap estimate of d, as the number of iterations 
increases, converges to the nonparametric maximum likelihood estimate of 
d. In other words, by resampling we approximate the mapping of P\,.... Ps 
to the distribution of ( Z S)V — s, v = 1,..., S, s / v) under the assumption 
that P s = P s for every s < S. 

When d is estimated by the Bayesian Bootstrap, the flow of the pro¬ 
cedure remains identical, with the exception that the initial components 
P\,.... Ps are replaced by random distributions P *,..., Pg. The random dis¬ 
tributions Pf,... ,Pg are defined by P* oc Y, t < n , W s,iI(x Sii ,Y Sti )i where W s>i , 
i <n s , are independent exponential variables with a fixed scale parameter. 
The Bayesian Bootstrap averages over iteratively generated random distri¬ 
butions P*,..., Pg. In this case, the resampling scheme allows us to obtain 
the Bayesian estimate of the Z 1 s dispersion under Dirichlet process priors 
with infinitesimal concentration parameters for P\.... ,P S . 

Stage 2. We specify a Bayesian model for the validation statistics Z. To 
simplify posterior computations, we plug in a zero mean multivariate normal 
distribution d into our model by matching the covariance matrix estimate 
from the Bootstrap algorithm in the previous paragraph. This choice, in 
several cases, is justified by convergence of the actual joint distribution of 
the validation statistics Z, for large sample sizes, to a Normal density. We 
will provide examples of such convergence. 

We introduce an exchangeable random partition n = {C \,..., C m } of 
{1,...,S}, where Cj, j = are groups of studies. The number of 

clusters m is a random variable. The random partition n of {1,...,S'} is 
specified by S exchangeable variables sampled from a discrete random dis¬ 
tribution; the Dirichlet process is an example. We refer to Lee et al. (2013) 
for an overview on exchangeable partitions. We use C(s) for indicating the 
subset of the partition n that includes study s. Also, we use pn to denote 
the law of the random partition. We state the probability model for Z\ it 
includes a latent partition and a set of random variables = 1,...) 

which play a role similar to the atom locations in a Dirichlet process mixture: 
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n ~ pn 


( 2 . 1 ) 


e = (e S) „; s, v = 1,..., S, s ^ v) ~ d and 
%s,v = 9C(s),C(y) “I" £s,vj S,V = 1, . . . , S, S ^ V, 


where the components pi, II and e are a priori independent and p^ is a 
distribution on the real line. 

The probability that the conditional expected values of a pair (s,v) of Z 
columns (or rows) are identical is strictly positive: 



P 


■r<S 


Also, the distribution of the array {pc(s),C[y) ; s, v = 1,..., S, s ^ v) is invari¬ 
ant with respect to any permutation a = (ay,..., as) of {1,..., S}, 



The model can handle an arbitrary number of additional studies (S + 
1, S + 2,.. .). Therefore, one can perform predictive inference by considering a 
future (S + l)th study and obtain, conditionally on the observed Z statistics, 
the distribution of {pc(S+i),C(s),PC(s),C(S+iy,s= T • • ■ ,5). 

Arrays with exchangeable rows and columns have been studied in a series 
of papers beginning with the contributions of Aldous (1981) and Hoover 
(1982). These authors proved de Finetti-type representations for these pro¬ 
cesses. Random arrays invariant in distribution to any simultaneous permu¬ 
tation a of rows and columns, such as {pc(s),C{v)'i s i v — 1); are called jointly 
exchangeable. This type of arrays arises, for instance, when relationships be¬ 
tween individuals are represented using two-way tables [Roy and Teh (2009)]. 
In our study, these representation theorems provide a formal justification to 
use latent cluster membership variables for modeling exchangeable arrays. 

2.4. Asymptotic normality of validation arrays. The proposed model for 
Z is closely connected with Dirichlet process mixtures. Consider, for ex¬ 
ample, S studies designed for estimating 6 S =E(Y Si j). A possible approach 
for exploring the hypothesis of multiple clusters defined by studies with 
identical means 0 S consists in combining approximate likelihood functions 
N(Y S = Y S} i/n s ;9 s , d‘^/v / n s ) with a random distribution F for the means, 

that is, 9 S \FF. See Burr and Doss (2005) for a detailed study of this ap¬ 
proach, and Dersimonian and Laird (1986) for a frequentist perspective. The 
approximation, from a Bayesian standpoint, consists in using Normal kernels 

with scale parameters y^A(Y S) j — Y s ) 2 /n s , and is supported by asymptotic 
arguments. Similarly, we combine an exchangeable random partition with 
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a multivariate Normal kernel d justified, in several cases, by asymptotic 
arguments. 

A smooth estimate of d is computationally convenient and circumvents 
artifacts that arise with a discrete one, including the possibility of poste¬ 
rior distributions assigning exactly null probability to most of the II config¬ 
urations. One can identify several cases in which the leave-one-in array is 
asymptotically Normal. Below we briefly discuss one case where Z converges 
to a multivariate Normal distribution on a linear subspace of We 

discuss results for logistic regression, Poisson regression, proportional haz¬ 
ards models and support vector machine procedures in the supplementary 
material [Trippa et al. (2015)]. 

Consider the linear model Y S \X S ~ N{X s /3 s ,Ia 2 ), with (Y s ,X a ) = (Y Sji ,X Sti] 
i <n s ), least squares estimates f5 s and mean squared errors (MSE) of pre¬ 
diction 


v \\Y v -X v p a \ 

^S,V — 


n v 


Here, and in all the examples in the Supplementary Material, we let all sam¬ 
ple sizes grow at the same rate, n s ~ c s ni, s = 2,..., S, and fix C 2 ,..., eg. 
Independence of \\Y V — X v fi v \\ 2 and (X v ,j3 v ) implies, under mild assump- 
tions on the X S i distributions, asymptotic normality. First, n v (||Y^ — 
X V /3 V || 2 — n v a 2 ) —> N( 0, 2a 2 ). Next, to obtain Z s>v , we need to add to the in- 
sample mean squared error n~ l \\Y v — X v f3 v \\ 2 a second term, n~ 1 [\\X v {j3 v — 
P s )\\ 2 + 2(8 V -5 S )X V X V (/3 V -/3 S ) + \\X V (5 V - 5 S )\\ 2 }, with 8 V = 0 V -/ 3 V ). It can 
be shown that n v l ^ 2 (8 v — 8 S )[X V X V (/3 V ~ Ps) — E(A\,A ' V (/3 V — /3 S ))] -A 0 and 
n v — 8 S )X V X V (8 V — 5 S ) -A 0. Finally, both n v 1 ^ 2 (8 V — 6 S )K(X V X V (/3 V — 
P s )) and n v ' (P v — f3 s )(X v X v — K(X V X V ))(/3 V — f3 s ) converge to normal den¬ 
sities. Asymptotic joint normality for Z follows from the asymptotic inde¬ 
pendence of S v and n v 1 ^ 2 (X V X V ). 


3. Cluster-based validation statistics. The procedure we propose gen¬ 
erates a posterior distribution p(Jl\Z) for the unknown partition n of our 
S studies. The tuning of the distribution pn and approaches for selecting 
the prior model are discussed in the supplementary material [Trippa et al. 
(2015)]. A representative partition summarizes the posterior distribution. 
We select an estimate II that minimizes the expectation of a loss function 
I(II,n), that is, II = argminE(Z(-,n)|Z). The partition II is a posterior point 
estimate. Quintana and Iglesias (2003) give a discussion on the decision the¬ 
oretic paradigm applied to random partitions. Several loss functions Z(II,n) 
have been proposed; see, for example, Denceud and Guenoche (2006). 
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We use the easily interpretable maximum transfer metric ; see Charon 
et al. (2006) for a recent contribution. This metric /(IR,!^) is defined 
as the minimum number of elementary corrections necessary to match the 
partitions Ill and II 2 ; an elementary correction consists of moving a unit 
to a different (possibly empty) subset. If we consider, for example, IIi = 
({1,2},{3,4}) and II 2 = ({1,4},{2,3}), then Z(IIi,II 2 ) = 2, and a possible 
chain of corrections is ({1,2}, {3,4}) —> ({1,2,3}, {4}) —> ({1,3}, {2,4}). 

Our procedure tends to assign studies to separate clusters when they differ 
on aspects that affect the validation statistics Z. The dissimilarity captured 
by the clustering method might be due to different measurement techniques, 
different predictors distributions or other factors varying across studies. In¬ 
terpretation of the inferred partition requires subsequent analyses to identify 
the primary causes of heterogeneity, such as data quality or experimental 
designs. The results can then inform the construction of models trained on 
multiple data sets. If, for instance, heterogeneity is driven by different dis¬ 
tributions of relevant predictors, but the covariates effects on the outcome 
are consistent across studies, then it might be appropriate to combine the 
available data sets. In contrast, if heterogeneity is driven by measurement 
errors or batch effects, additional efforts may focus on data normalization 
steps. 

We can now introduce the concept of clustering-based validation perfor¬ 
mance measure, by which we mean summary statistics aimed at assessing 
cross-study prediction taking into account study heterogeneity and within- 
cluster similarities. Recall that model 2.1 formalizes the identity between 
the conditional expected values of Z s>v and Z s /y when study s clusters to¬ 
gether with s' and v clusters with v'. For example, we may be interested 
in the performance measure obtained when one trains on any of the studies 
in the cluster of study s and validates in any of the studies in the cluster 
of study v, that is, Hc{s),C(v)- The latent variable C(s) indicates the clus¬ 
ter that includes s and Hc(s),C(v) can be interpreted as the expectation of 
Z S)V assuming that studies s and v are repeated de novo. A point estimate 
®(Mc(s),C(b) \Z) can be obtained by averaging lE(lic(s),C('u)|n, Z) with respect 
to the posterior distribution of the partition II. Similarly, one may derive 
interval estimates. 

We can also estimate the validation performance that one would ob¬ 
tain from training in a study from the set C(s), and validating in a future 
(5 + l)th study, by using Hc(s),c(S- 1 - 1 ) • In particular, the joint posterior dis¬ 
tribution of t L c(s),c(S+ 1 ) an d t l c{v),C(S- 1 - 1 )j with s,v < S , can be used for 
comparing studies s and v. 

Let B be a subset of studies in {1,..., S}. We extend the definition of the 
validation statistic Z S)V to handle the case where a model is trained on the 
combination of the data from all the studies in B, and then validated on 
study v. We denote the resulting validation statistic by Zb, v - If B includes 
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v, then v is not used to train the model, and Z B)V is redefined to be the 
same as Z B \ VV , where B \v = {s < S: s € B and s 7 ^ v}. We also use B(s) C 
{1,..., S} to denote the studies within the same II latent cluster of s, that 
is, S(s) = {u<S:C(s) = C'(u)}. 

Clustering has the goal of identifying homogeneous groups of studies with 
similar sampling distributions. When this works, it is natural to train models 
by combining the studies in a cluster. However, the figure of merit used for 
the Z summary, not unlike a loss function, implies adopting a specific one¬ 
dimensional perspective in looking at the data. It is possible, for example, 
that two studies with different covariate distributions might be clustered to¬ 
gether, or two studies which only differ in design, but not in the populations, 
may be allocated to separate clusters. 

Clustering can be used to estimate the performance obtained when val¬ 
idating in study s after training on studies in B(s), that is, using only 
data sets similar to s. This task reduces to estimating Z B ^ S . The function 
B —>• Z B ,si over the collection of {1,...,S} subsets, can be directly com¬ 
puted using our S data sets and is not related with the Bayesian model, 
but Z B ( S ^ S , the value of this function at B(s), is estimated because B(s) is 
an unknown latent component of the model. This approach is only useful 
when there is no strong evidence that s belongs to a singleton cluster. We 
thus estimate Z B ^ S by using the posterior distribution of the partition n 
and conditioning on the event B(s) 7 ^ {s}. We report both the estimate of 
Z B ( s ),s obtained by averaging over n configurations with B(s) / {s} and the 
posterior probability of the conditioning event B(s) 7 ^ {s}. Alternatively, we 
can generate a plug-in estimate Z B ,^ g by focusing on B(s), the cluster in 
II that includes the sth study. 

When we estimate Z B r s ^ s the goal is to evaluate a model trained by a 
homogeneous set of studies B(s). Our clustering procedure uses validation 
statistics to detect study heterogeneity, and therefore the resulting partition 
is representative of differences between studies captured by the Z valida¬ 
tion summaries. Studies included in the same cluster could still differ in 
important ways. We consider this point further in the discussion. 

For comparing studies, we also need to be concerned about the poten¬ 
tial for variations of clustering-based summaries, such as Z B ^ S , driven by 
different total sample sizes within each cluster. Under the assumption of 
identical sample sizes n\ = ■ ■ ■ = ns, which will be later removed, one can 
expect that the value Z B ^ S \ S improves with the number of studies in B(s). 
We thus define the sample size adjusted validation statistics Z B g . The defi¬ 
nition of these statistics is analogous to that of Z BtS . We randomly select j 
distinct samples from the ensemble of studies B. We train a model on these 
j samples and validate it on data set s to generate a performance measure, 
say, an AUC. We iterate this procedure, keeping fixed both B and s; Z J B s 
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is the average of the accuracy measures obtained during these iterations. In 
this case, if B includes s, then the units in s are not selected for training 
the model. The index j can vary from a minimal size of interest up to the 
overall number of samples in B\s. 

Our interest is in the map j —> Z B ^ g ; recall that B(s ) is unknown but 
can be estimated using the posterior distribution of II. The statistics Z B ^ g 
have an interpretation similar to Zb( s ), s \ moreover, one can contrast the 
estimates of Z J B ^ g and Z B ^ y to compare the sth study to the uth study. 
We can estimate Z B ^ s plugging in the point estimate II or directly using 
the posterior distribution of II. If we follow the first approach, the estimator 
is Zg ^ , while the second approach averages with respect to the posterior 

distribution of B(s). In both cases we estimate, assuming Y1b(s) n v > 3 + n s, 
the mean value of the validation statistic when the algorithm is trained by 
j data points from the unknown subset B(s)\ s and then validated on s. 
In the second case, we report the posterior probability of J2b(s) n v > J + 

and compute our estimate conditionally on this event because Z B ^ g is well 
dehned only when B(s ) includes at least j + n s units. 

4. Simulation study. 

4.1. Scenario 1. The goal of this simulation study is to illustrate the 
extent to which our model-based approach contributes to the interpretation 
of cross-study validation statistics, beyond what can be learned from direct 
visualization of Z. As this relies on estimating the unknown partition II 
and the latent Hc(s),C(v) variables, we also discuss our model’s ability to 
reconstruct these. 

The scenario is defined by 9 studies grouped into three clusters, C\ = 
{1,2,3}, C 2 = {4,5, 6 } and C 3 = {7, 8 ,9}, which differ in the amount of mea¬ 
surement error in the predictors. All studies have a sample size of 300. For 
subject i from study s we have a binary outcome Y s l and 50 candidate 
predictor variables X s ^. In group C\, the 50 covariates are simulated from 
a multivariate Normal distribution with null mean; all variances are equal 
to 17. The dependence between X s j and Y s _ t , s = 1,2,3, is specified by a 
logistic regression function; 10 regression coefficients are equal to 0.1 and 40 
are equal to 0. In group C 2 we add independent measurement errors with 
null mean and standard deviation equal to 14 to 50% of the covariates. In 
C *3 we add independent measurement errors with mean 0.33 and standard 
deviation 8 to all covariates. 

For each study we obtain a prediction model by fitting a logistic function 
using ridge regression; we tune the penalization parameter with standard 
cross-validation. We then assess model performance using the mean abso¬ 
lute error (MAE) of prediction, that is, Z s>v = n~ l JT \\Y V ^ — logit” 1 (/3° + 
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PsX v ,i )||, where (j3°,j3 s ) denote the regression coefficients estimated using 
only data from study s. 

Figure 1(A) shows the Z array for a single simulation, with rows cor¬ 
responding to training data sets and columns corresponding to validation 
data sets. This array shows that sampling variability accounts for a rele¬ 
vant part of the observed differences across validation summaries, and the 
resulting panel is not easily interpretable by direct visual inspection. Fig¬ 
ure 1(B) shows Monte Carlo approximations of the true expected values 
Cs,v °f the Z s>v variables under the described sampling models. The ex¬ 
pected value ( SjV is computed integrating with respect to the actual distri¬ 
butions (P S ,P V ) of (X Sj j, y Sj j) and (X v> i,Y Vt i). Figure 1(C) shows the cluster- 
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Fig. 1. Leave-one-in array, with rows corresponding to training data sets and columns 
to validation data sets. Panel (A) shows the leave-one-in array Z for a single simulation. 
Panel (B) shows the true expected values £ s ,„ of Z StV . Panel (C) shows the Bayesian es¬ 
timates W.(fj,c(s),c(v)\Z). The diagonals in panels (A), (B) and (C) are blank. Panel (D) 
considers 500 simulations and plots the empirical estimates Z SlV against the Bayesian es¬ 
timates ¥.(nc( s ),c(,v)\Z). Panel (D) considers a training data set s in C\ and a validation 
data set v in Ci- The green lines correspond to the true expected value £ s ,„. Panel (D) 
also reports the MSE ratio contrasting the Bayesian estimates with the empirical esti¬ 
mates. Panel (E) contrasts the Bayesian estimates of with the empirical estimates by 
displaying the MAEs. Panel (E) considers all combinations with s and v in Ci,C 2 or C 3 . 
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based Bayesian estimates ^{lJ-c{s),C(v)\Z) based on our two-stage procedure. 
In this simulation, our clustering procedure gives a point estimate II = 
[{1}, {2,3}, {4,5,6}, {7,8,9}] of the latent partition. The distance Z (fl, IItrue)) 
measured with the maximum transfer metric, is equal to 1. 

Comparison of panels (A) and (C) shows that the two-stage procedure 
correctly reconstructs the block structure of the true expected values ( SjV dis¬ 
played in panel (B). Also, the procedure correctly identifies a group of stud¬ 
ies, which are not affected by measurement errors, with estimated fJ>c(s),c(s) 
value below 0.2. 

We repeated the simulation 500 times. In each iteration, and for each pair 
(s,v), we estimated the unknown means using our Bayesian estimator 
E(a*c(s),C(«) \Z) an d the empirical estimator Z SyV . The results are plotted in 
Figure 1(D) against each other for a single (s,v) combination, with s in C\ 
and v in C 2 . Then, for each (s,v) combination, we contrasted the MSEs and 
the MAEs of the Bayesian estimates with the empirical estimates. Across all 
(s,v) combinations the Bayesian estimator has lower MSE and MAE than 
the empirical estimates. These results are graphed in panel (E); each point 
corresponds to one (s,v) combination, and the MAEs of the Bayesian and 
empirical estimates are plotted against each other. In this comparison the 
Bayesian estimator achieves a substantially lower dispersion around the true 
expected value Cs, v compared to the empirical estimator. 

For each simulation we computed /(II, IItrue)) the number of elementary 
set operations between the true and estimated latent partition. On aver¬ 
age this distance is 1.63 and, in most iterations, II has a distance of 2 set 
operations or less from II. 


4.2. Scenario 2. We consider a sampling model previously used in Wal¬ 
dron et al. (2011). We use it to investigate how the comparison of alternative 
algorithms is enhanced by Bayesian modeling of the Z arrays. Here we add 
measurement errors to the outcome variable in subsets of studies. We in¬ 
vestigate how modeling of Z allows algorithm performance assessment for 
continuously varying training sample size. The main focus is on the maps 
j —> Z 3 b ^ s to contrast methods. We also highlight how posterior inference 
on clustering based statistics, such as the estimates of Hc(s),C(v) , captures 
uncertainty on the algorithms’ performances. 

We simulated 540 zero-mean Normal predictors X s j with a covariance 
matrix structured in blocks: 


a l,j 


= 


1 , 

0 . 2 , 

0 . 2 , 

0 . 2 , 

0 , 


if 1=3, 

if l,j < 100 and l / j, 
if 100 <l,j< 200 and l / j, 
if 200 <l,j < 370 and l / j, 
otherwise. 
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Conditionally on these predictors, we then generated binary outcomes Y s j 
with E(y Sj j|X Sj j) = [1 + exp(—/3X S) j)] _1 . Here i<n s = 100 and s = 1,..., 9. 
The regression coefficients (/?i,... ,/I 540 ) are f3j = 0.2 for j < 370 and /3j = 0 
for j > 370. Departing from the sampling model of Waldron et al. (2011), 
we added measurement errors to Y S) i- by changing the value of Y s i with 
probability 0.05 in C\ = {1,2,3}, 0.25 in C 2 = {4,5,6} and 0.5 in C 3 = 
{7,8,9}. These probabilities are independent of Y s and X s . Note that any 
classification approach applied to studies s = 7, 8 ,9 has an average error rate 
of 0.5 because the binary outcomes Y S i, after measurement errors, become 
independent from the covariates and E(Y^ j|X s j) = 0.5. 

We consider for illustration three classification methods: LASSO regres¬ 
sion, ridge regression and a linear support vector machine; penalization pa¬ 
rameters are tuned with cross-validation. We choose our validation statistics 
to be the classification error rates. For each study s, we computed the true 
clustering-based Z B ^ g statistics; in simulation studies the true latent par¬ 
tition, as well as B(s), s = 1, ... ,S, is known. If, for instance, s = 1 , then 
B(s)\s = {2,3}, and Z B ^ s measures the average classification performance 
obtained when a model is trained by j < 200 records randomly sampled from 
B(s)\s. The classification performance is obtained through empirical vali¬ 
dation on data set s. 

We then used the posterior distribution of B(s)\ s and computed the 
estimates E(Z J B ^ JZ, J^ v eB(s)\s n v — j)- The first three panels in Figure 2 
contrast Z B ^ g (dashed lines) with the Bayesian estimates (solid lines); each 
color corresponds to one of the three methods. Overall, the estimates cor¬ 
rectly portray the differences that exist between the performances of the 
three algorithms; in this scenario, the support vector machine slightly out¬ 
performs ridge regression, which, in turn, has lower prediction errors than 
LASSO. These differences are shown in the third row of Figure 2 where 
we plot the maps j —>• Cs, with equal to the expected value of Z 3 g ^ s . 
The second line of panels in Figure 2 shows the posterior probabilities 
p(C(s ) = C{v)\Z). In this example, the proposed model captures the un¬ 
derlying partition of the 9 studies and the differences across methods’ per¬ 
formances. 

We repeated the simulation 500 times, generating 9 independent data sets 
for each iteration. In the bottom three panels of Figure 2, we show medians 
and quartiles of the Z J B ^ s posterior estimates, for j = 100,200, obtained 
across these 500 iterations. These are compared to approximations of the 
true maps j —>• ( J s , obtained by averaging the true error rates Z B ^ s across 
simulations. These maps are displayed with solid lines in the third row of 
panels in Figure 2. These panels summarize the distribution across sirnu- 
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Fig. 2. Clustering based validation statistics. The top row considers a single simulation 
and compares the true values of the validation statistics s (dashed lines) with our 

Bayesian estimates (solid lines) for varying size of training sets. Using the same data, 
the second row displays the posterior probabilities that two studies, s and v, are clustered 
together. The third row summarizes results from 500 simulations; solid lines display the 
true expected values of Z J B ,, s , while dots marks medians and quartiles of the corre¬ 
sponding Bayesian estimates at j = 100 and j = 200 across simulations. Colors denote the 
three algorithms. In the 1st (2nd, 3rd) column s = l (4,7). 


lations of the estimated clustering validation measures s an< ^ confirm 

that the estimates are representative of the performances of the algorithms 
being compared. 
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Table 1 

The nine ovarian cancer data sets considered in this study. We only considered late-stage 

serous tumors from these studies 


s 

Study 

n s 

Microarray platform 

1 

Bentink et al. (2012) 

117 

Illumina Human v2 

2 

Crijns et al. (2009) 

157 

Operon Human v3 

3 

Yoshihara et al. (2010) 

110 

Affymetrix hgug4112a 

4 

Bonome et al. (2008) 

185 

Affymetrix hgul33a 

5 

Tothill et al. (2008) 

139 

Asymetrix hgul33plus2 

6 

The Cancer Genome Atlas Research Network (2011) 

420 

Asymetrix hthgul33a 

7 

Mok et al. (2009) 

53 

Asymetrix hgul33plus2 

8 

Konstantinopoulos et al. (2010) 

42 

Asymetrix hgu95av2 

9 

Dressman et al. (2007) 

59 

Asymetrix hgul33a 


5. Application to survival prediction in cancer. We illustrate an applica¬ 
tion to the development of a prediction model for overall survival of ovarian 
cancer patients using microarray gene expression data. Ovarian cancer is 
the most lethal gynecological cancer, and numerous groups have undertaken 
microarray experiments to measure tumor gene expression for development 
of prognostic models of patient survival. It is widely accepted that gene sig¬ 
natures proposed for clinical application must be validated on independent 
data sets. In this area of research several strategies and methods have been 
proposed for prediction. Which one works best? How much uncertainty is in¬ 
volved in ranking methods? Posterior probabilities on the Hc{s),C(v) random 
variables are suitable for answering these questions. 

We identified nine previously curated studies utilizing five different mi- 
croarray platforms, each providing patient overall survival for at least 40 
late-stage, serous-type, ovarian tumors (Table 1). Microarray data were pro¬ 
cessed using standard normalization methods, after which probe identifiers 
were mapped to standard gene symbols, as provided by the curatedOvar- 
ianData library [Ganzfried et al. (2013)]. Only gene symbols represented on 
all platforms were considered for across-platform comparability. We noted 
that limiting consideration to those genes present across all platforms has 
a negligible effect on prediction performance. For example, we separately 
fitted Cox models with ridge penalty and estimated with cross-validation C- 
statistics, separately considering one study at a time; the average decrease 
of the C-statistics when only genes present in all platform were considered 
compared to using all available genes was less than 0.01. 

5.1. Accounting for different sample sizes. The sample size n s varies 
across studies. One can therefore expect higher values of the validation 
statistics Z SjV for those models trained in the largest studies. To prevent 
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this from creating artifactual clusters of studies, we apply an intuitive cor¬ 
rection. 

We selected a threshold of 110 samples and considered the 6 studies that 
have a sample size larger than the threshold. We then computed the em¬ 
pirical estimates Z 1W = (Z^ *,°; s, v = 1,2,3,4,5,6 ,s / v). The computation 
of Z\™ is straightforward. We iterate two steps: (i) we train a prediction 
model M 410 with 110 data points sampled without replacement from the 
sth data set, then (ii) we validate the resulting model on the entire utli data 
set. We set Z 440 equal to the average value of the validation statistics across 
iterations. The statistic Z 440 estimates the performance of a model trained 
by 110 samples from P s . We computed these estimates with 200 iterations. 

The covariance matrix E 440 of Z 440 is then estimated by bootstrapping. 
The array Z 110 and E 440 are used for obtaining the posterior distribution of 
the random partition II 110 with the model proposed in Section 2; we only 
replace (Z, II) with (Z 440 ,n 440 ). The reported probability that two studies, 
say, s = 1 and v = 6 , belong to the same cluster is provided by the posterior 
distribution p(II 440 |Z 440 , E 440 ). 

Next, we need to extend this posterior, which refers to the 6 studies we 
selected, to the remaining 3 which have less than 110 samples. To achieve 
this goal, we compute p(n 42 |Z 42 , E 42 ) by reducing the threshold from 110 
to 42, and report the following adjusted random partition: 


p(n 42 = vr|Z 42 , E 42 ) x p(n 110 = A 110 (7r)|Z 110 , E 110 ) 
1(A 11( V) = A 110 (7r))p(n 42 = tt'|Z 42 , E 42 ) 


where the sum is over possible partitions of the 9 studies and the operator 
A 110 projects them into partitions of the 6 studies {1,2,3,4,5, 6 } above the 
110 samples threshold. Two of these 6 studies (s,v) are clustered together 
by 7 r if and only if they are clustered together by A 110 ( 7 t). Expression (5.1) 
implies p(A 110 (n) = •) =p(n 110 = -|Z 110 ,E 110 ). 

This correction for sample size effects preserves the interpret ability of 
the clustering algorithm. It also avoids more complex constructions, such as 
replacing the latent random variables /i in ( 2 . 1 ) with latent functions for 
sample size-specific average validation statistics. The most computationally 
intensive stage of the procedure is the computation of E 42 and E 110 ; the 
arrays Z 42 and Z 110 have been resampled 1000 times. 


5.2. Comparative analysis of prediction methods. Ovarian cancer stud¬ 
ies for developing prognostic signatures are commonly based on two dis¬ 
tinct groups of data sets, a training group, which in most cases only in¬ 
cludes a single data set, and a group of publicly available validation data 
sets. A recent example that presents key questions related with our study is 
Kang, D’Andrea and Kozono (2012), and the subsequent comment Swisher, 
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Taniguchi and Karlan (2012). The goal in Kang, D’Andrea and Kozono 
(2012) is to develop a molecular score based on expression of 151 genes that 
are involved in platinum-induced DNA damage repair to predict response 
to chemotherapy. This exemplifies using a biological hypothesis to preselect 
predictors for constructing prognostic models, thus avoiding some of the 
challenges arising in the “large p small to” setting. In Swisher, Taniguchi 
and Karlan (2012) authors point out that both independent validations and 
a suitable sample size of the validation data set are essential for assessing 
prediction models. 

Prescreening the space of predictors has the advantages of parsimony and 
interpretability, but comes at the cost of some information loss. Our goal 
in this section is to quantify this trade-off using cross-study validation. We 
use the truncated C T statistic as proposed in Uno et al. (2011), truncated 
at r = 3 years, for measuring survival prediction accuracy. The C T statistic, 
given a prediction model M and independent (possibly censored) survival 
data with covariates (Y),AQ), i = 1,... , to, from an unknown distribution P , 
estimates the conditional probability P(r n+ \ > r n+ 2 \Y n+ i < Y n+ 2 ,Y n+ i < r). 
The random variables (r n+ i, r n+ 2 ) are risk scores computed from M based 
on individual covariates (X n+ \,X n+ 2 )] if, for instance, M is a proportional 

hazards model with coefficients j3, then r n+ 1 = f3X n+ i and r n+ 2 = f3X n+ 2 . 
The estimate converges, under the assumption of noninformative censoring, 
to the unknown conditional probability. It is only required that the censoring 
cumulative distribution function remains below 1 at r. 

We applied our method with prediction models constructed using several 
approaches. The first one is a direct application of survival ridge regression, 
using available gene expression data, under the assumption of proportional 
hazards with a linear link function. The second is similar to the approach 
followed in Kang, D’Andrea and Kozono (2012); we only use available gene 
expression data within the selective list proposed by the authors. Note that 
we do not attempt to reproduce their study; we follow a similar strategy. 
Also, in this case prediction models were derived using penalized maximum 
likelihood. Additionally, to these two approaches we also attempted the use 
of Kernel-based methods for estimating a smooth nonparametric link func¬ 
tion [Li and Luan (2003)]. This produced results (i.e., values of the C T 
estimator) clearly inferior to the first two approaches. 

Our goal is to show that the cross-study validation approach we present 
here facilitates methods comparison by estimating the easily interpretable 
/r latent variables. All the analyses were repeated separately under each 
method. Modeling of the Z array, in this example, produces an appreciable 
reduction of the uncertainty on the p latent variables compared to the direct 
computation of the credible intervals by bootstrapping. All the model-based 
estimates of the p latent variable are shrunk toward the average of the Z 
entries. 
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Fig. 3. Validation analysis based on C r statistics. The left panel considers ridge regres¬ 
sion based on all available gene expression data, while the right panel considers only a list 
of genes selected on the basis of the proposal in Kang, D’Andrea and Kozono (2012). Each 
colored point illustrates a validation statistic; colors indicate the training data set s 
while the integers in gray indicate the validation data set v. The symbols indicate the 
corresponding Bayesian estimates Ec(u) \Z 110 ). The dashed lines are 80% confidence 
intervals of the unknown means K(Zl)O) obtained by Bootstrapping. The “* ” symbols in¬ 
dicate 80% credible intervals obtained from the posterior distribution of Pcm c(v) given 
Z 110 . 


Figure 3 shows the observed validation statistics . As mentioned, 
these are empirical estimates of predictive performances adjusted for sam¬ 
ple size variability. Each panel corresponds to one of the two approaches 
we compare, and colors indicate the training data sets, while the integers 
displayed in grey indicate the validation data sets. The plots show the 80% 
confidence intervals of the unknown means E (Z}. 1 ®) obtained by bootstrap¬ 
ping (dashed lines). They also display the model estimates (marked with 
the symbol) of the Mc(° s ),g>) variables and the 80% credible intervals 
(marked with the “*” symbol). Under both approaches all variables 

are estimated within the (0.5,0.6) interval. Our comparison suggests that 
models fitted after upfront selection of a subset of genes based on biological 
hypothesis perform worse than using all gene expression data. This indicates 
that genes other than those involved directly in DNA damage repair can con¬ 
tribute to explaining survival of ovarian cancer patients. All comparisons of 
the Bayesian estimates E^^, C ( v )\Z 110 ) under the two approaches are con¬ 
sistent with this evaluation. We also compared the posterior distributions 
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of c(y) un der the two approaches; at each pair (s,v), when we sample 
from the posterior distributions we obtain inferior values for the 

selective approach with probability above 0.67. If we use all genes, we obtain 
a probability of 0.78 that all At^) c(v) are l ar S er than 0.5, meaning that all 
models perform better than assigning risk scores completely at random. 

The posterior distribution of the latent partition of the 6 studies with 
sample sizes above 110 suggests the existence of two clusters, one includ¬ 
ing studies 4 and 5 and the other with all remaining studies. The estimate 
of the latent partition is identical under the two considered approaches for 
constructing predictive models but needs to be combined with relevant un¬ 
certainty. In particular, in both cases the partition constituted by a single 
degenerate subset with the six studies together accumulates posterior prob¬ 
abilities from both approaches above 0.15. When we added to the analysis 
the remaining three studies with sample sizes below 110, the resulting prob¬ 
abilities of the degenerate partition remained above 0.12. In summary, we 
found moderate evidence of a nondegenerate partition with heterogeneous 
subgroups. 

In order to interpret the latent partition estimate of our leave-one-in anal¬ 
ysis, we computed clustering-based validation statistics. Each solid line in 
Figure 4 is representative of a data set s and illustrates, for hypothetical sam¬ 
ple sizes j from 100 to 600, estimates of how well outcomes in study s can be 
predicted by randomly selecting j data points within the cluster containing 
s. More formally, the y axis shows estimates of Z J B ^ g with j = 100,..., 600. 

In this example the reported probabilities of the events Yl v eB(s)\s n v — 7 are 


Validation datasets: 
- 1 



Sample size 


Fig. 4. Sample size adjusted validation statistics. Solid lines display estimates of the 
clustering-based statistics Z J B ^ s for values of j ranging between 100 and 600. Dashed 
lines display the validation statistics Z ^ S y g , that is, the cluster B(s) is replaced with 
the entire collection of 9 studies. Each color corresponds to a specific validation study s. 
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all above 0.6 for j < 300. These estimates are contrasted (dashed lines in 
Figure 4) with Zj^ S y g . These are estimates of how well outcomes in s can 
be predicted by randomly selecting j data points from all available studies. 
We observe little difference between the solid and dashed lines. This similar¬ 
ity suggests that the clustering is not driven by heterogeneous data quality 
levels across studies (i.e., there is no evidence of clusters that produce pre¬ 
diction models with poor performance). Clustering is driven by studies 4 
and 5, in which, due to covariates distributions, it appears relatively easier 
to achieve C-statistics above 0.6 compared to all other studies. Clustering- 
based statistics in Figure 4 suggest additional samples, above 600 and above 
the overall number of samples from the nine studies, might significantly con¬ 
tribute obtaining better prediction models. 

For a comparison, we fitted the data sets with a hierarchical propor¬ 
tional hazards model, with studies clustered through a Dirichlet process, 
and Normal marginal priors for the regression coefficients. The prior assigns 
a vector of regression coefficients to each cluster of studies, while coefficients 
are independent across clusters. To facilitate the comparison, we tuned the 
Dirichlet process to match the estimate of the number of clusters in our 
leave-one-in analysis. We used the list of possible values for the latent parti¬ 
tion and approximated the posterior using the approach discussed in Sinha, 
Ibrahim and Chen (2003). Our interest is in comparing the latent parti¬ 
tions obtained using the model just described to those from the validation 
analysis. If clustering in the leave-one-in analysis is driven by differences 
in the study-specific regression models, and not in the distributions of the 
predictors, then one expects the two approaches to infer similar partitions. 
Instead, the total variation distance between posterior distributions is 0.79, 
and we did not notice similarities. This is consistent with the interpretation 
of the partition inferred through the validation analysis that we discussed 
in the previous paragraph. 

We also considered random survival forests for constructing prediction 
models; we used methodology and software discussed in Ishwaran et al. 
(2008). This method directly provides mortality scores for each sample in 
a test data set. Under several choices of the tuning parameters involved in 
the application of random forests, including minimal final nodes sizes [see 
Ishwaran et al. (2008) for details], the resulting predictive models appeared 
inferior to ridge regression when compared using C T statistics. Under all 
considered choices of the tuning parameters at least 66% of the c ^ 

estimates were inferior to ridge regression. Contrasting results with random 
survival forests based on all available gene expression data versus the use 
of selected genes as suggested in Kang, D’Andrea and Kozono (2012), we 
obtained again higher estimates using all available gene expression 

data. 
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6. An example of heterogeneous studies. Next we discuss cross-study 
validation of four nonsmall cell lung cancer studies recently reviewed in 
Ferte et al. (2013), based on the data sets curated by the authors. The data 
structure is similar to the previous example and includes gene expression 
predictors and patient survival times. We refer to Ferte et al. (2013) for a 
detailed description. The four studies and corresponding samples sizes are 
as follows: the Director’s challenge study Shedden et al. (2008) (299), Zhu 
et al. (2010) study (62), Hou et al. (2010) study (79) and the TCGA [Ham¬ 
merman et al. (2012)] study (90). This example emphasizes the necessity 
of accounting for study heterogeneity and that averaging the Z SjV statistics 
does not provide a complete description of models’ performances. 

The Director’s study [Shedden et al. (2008)] includes data from 4 insti¬ 
tutions. Our first analysis investigates whether there are large differences 
in the data originating from these institutions. The posterior of the model 
assigns probability 0.83 to the event that these 4 data sets are clustered 
together. Then, we considered the hypothesis of clusters involving the re¬ 
maining data sets [Zhu et al. (2010), Hou et al. (2010), Hammerman et al. 
(2012)]. The approach is identical to the description in the previous section: 
we trained models using gene expression data and validated using concor¬ 
dance C T statistics. The posterior of n produced an estimate of two clusters, 
one including the Director’s study and the Zhu et al. study, and the second 
including the remaining two studies. The posterior strongly supports the 
hypothesis of separate clusters. The posterior probability at the estimated 
configuration II is 0.44, and 0.48 posterior probability accumulates on the 
neighborhood {n:£(n,II) = 1}. 

We finally compared sample size adjusted statistics Zl. v to interpret the 
clustering configuration. Figure 5 summarizes the main discrepancies visu¬ 
alized with these comparisons. On average, models fitted with 50 < j < 290 
samples from the Director’s study tend to achieve substantially higher vali¬ 
dation results when validated on the Zhu et al. study (blue line) than when 
validated on the remaining two studies (black lines). In the latter case the 
validation statistics decrease with training data set sample size, and the fit¬ 
ted models fail to predict survival times. We tested this difference using the 
bootstrap covariance estimates. The evaluation of prediction models pro¬ 
duced by the largest study [Shedden et al. (2008)] changes considerably if 
we only average the validation statistics across the three remaining studies, 
and it appears appropriate to report substantial discrepancies when we vali¬ 
date the Director’s study results with the Zhu et al. study, versus validation 
on the Hou et al. and TCGA studies. While this is beyond the scope of 
our analysis, the next step is to investigate in depth the reason for these 
discrepancies. 

7. Discussion. Despite the availability of large collections of related data 
sets in many areas of application, articles that evaluate statistical learning 
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Fig. 5. Sample size adjusted validation statistics for interpreting the clustering estimate 
II. The plot displays Z 3 S<V validation statistics when the model is trained by the largest 
study [Shedden et al. (2008)] and validated on the remaining studies (black and blue lines). 
Additionally, it displays validation statistics when we train on the Hou et al. study and 
use the TCGA study for validation (green line). Prediction models have been trained using 
ridge regression. 

algorithms based on a comprehensive analysis of available data sets remain 
a minority. Those using more than one data set are often based on cross- 
validation within each study due to heterogeneity between studies; see Jap- 
kowicz and Shah (2011), Demsar (2006) and Bernau et al. (2014) for discus¬ 
sions. Similar to meta-analyses for evidence synthesis, comprehensive model 
evaluations need to jointly consider study heterogeneity and algorithm per¬ 
formance. Here we propose a Bayesian approach to compare algorithms while 
incorporating relevant sources of uncertainty, including uncertainty on the 
comparability of independent studies. 

The basis for our framework is the leave-one-in array Z of validation 
statistics. The concept is applicable to any validation statistic, such as con¬ 
cordance indices, classification errors and distances between predicted and 
observed responses. While it is certainly possible, and very useful, to simply 
use the leave-one-in array as a visualization tool without further model¬ 
ing, our experience with evaluating genomic signatures in cancer suggests 
that modeling can substantially enhance interpretability of the leave-one-in 
analysis. Modeling addresses study heterogeneity, can prevent erroneous in¬ 
terpretations driven by sampling variability in the summary statistics, can 
help address multiplicity issues, and can formalize the process of identifying 
outlying studies requiring separate consideration. The analysis of the Z ar¬ 
ray helps interpreting the range of observed cross-study validation statistics, 
whether it is caused by differences in the study-specific distributions P s or 
it reflects sampling variability. 
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Our two-stage procedure is based on a single figure of merit Z : this choice 
is motivated by the need for a simple strategy and by the consideration that 
this still accomplishes the main goal to control sources of overoptimisnr such 
as over-fitting, selection of favorable training/testing combinations and the 
use of internal cross-validations when the studies at hand are heterogeneous. 
Use of a one-dimensional figure of merit can, however, be a limitation. For 
example, if two studies generated data of poor quality, perhaps because of 
errors during sample processing and data management, our algorithm would 
likely cluster them together, because they both fail to produce accurate 
predictions and generate similarly poor Z scores when used for validating 
candidate models. These two studies might still be different in important 
ways; for example, they may consider two different populations. From this 
perspective, additional summaries of the data and potentially additional 
analyses may be advisable to identify differences between studies. 

When multiple studies are available, a natural direction is to combine 
them. Bayesian hierarchical models, for instance, have emerged as a very 
useful paradigm to borrow information across studies [Lindley and Smith 
(1972) and Morris and Normand (1992)]. The leave-one-in analysis is not 
intended to replace combined analyses, but to address a different question: 
cross-study replicability of prediction. We consider the evaluation of predic¬ 
tion methods using leave-one-in matrices an important complementary goal 
and, in some cases, a prerequisite to the construction of predictive models 
based on multiple data sources. The analysis of leave-one-in matrices can be 
used not only to compare prediction methods, but also to select the most 
appropriate prediction methods for a subsequent combined analysis. In a re¬ 
lated application to ovarian cancer prognosis using gene expression profiles, 
we illustrate a case where we first use cross-study validation to quantify the 
extent to which existing prognostic algorithms can produce results that hold 
up across studies [Waldron et al. (2014)], and then proceed to develop new 
prognostic algorithms based on a combined analysis [Riester et al. (2014)]. 

One advantage of the leave-one-in approach is that it can be used to 
evaluate any prediction approach, including heuristic procedures for which 
it might be challenging to construct hierarchical extensions. The modeling 
complexity that comes with constructing joint models for multiple studies 
varies across fields. In some cases the algorithms are based on probabilistic 
models and multi-study extensions are possible. In others they are not, and 
might be based on heuristics or be very specific to the field of application. 
The complexity and problem-specific competence necessary for developing 
joint models for heterogeneous data sets are greater compared to the analysis 
of Z matrices for off-the-shelf methods. 

To address study heterogeneity, we cluster studies with similar validation 
profiles through a latent partition. The computation of the posterior distri¬ 
bution of the latent partition is straightforward and is a direct application 
of established computational strategies for fitting Dirichlet mixture mod- 
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els. We refer to the supplementary material [Trippa et al. (2015)] for more 
details. Clustering sharpens the interpretation of the cross-study validation 
results by allowing one to explore the maps B —> Zb, s , focusing on either 
the estimates B(s ) or on those partitions that a posteriori appear consistent 
with the dispersion estimate d and the observed array Z. 

A simple alternative to formal Bayesian clustering of data sets is a re¬ 
ordering of rows and columns of Z, by maximizing objective functions, to 
obtain high values of the validation statistics close to the matrix diagonal. 
While this is perhaps simpler than what we propose, it can be dangerous to 
interpret the Z s v validation summaries without consideration of the associ¬ 
ated sampling variability, and it is easy to introduce an optimistic bias with 
clusters obtained by optimizing intra-cluster validation statistics. 

In this article we only considered external validation statistics, where 
training and testing are performed on separate studies. Alternatively, one 
could integrate internal cross-validation into our framework by adding a di¬ 
agonal to the Z array, with entries consisting of within study cross-validation 
statistics. A drawback of standard cross-validation techniques in this con¬ 
text is that they may result in overly optimistic assessments [Bernau et al. 
(2014)]. 

In this work we compared learning algorithms by separate analyses of the 
resulting Z arrays, but a natural extension is the joint analysis of multiple Z 
arrays corresponding to competing algorithms. A similar discussion applies 
to consideration of multiple validation statistics at the same time. A separate 
refinement could seek a data-driven approach for selecting the thresholds 
described in Section 5.1 to correct for sample size differences across studies. 

SUPPLEMENTARY MATERIAL 

Supplement to “Bayesian nonparametric cross-study validation of predic¬ 
tion methods” (DOI: 10.1214/14-AOAS798SUPP; .pdf). We discuss results 
for logistic regression, Poisson regression, proportional hazards models and 
support vector machine procedures in the supplementary material. 
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